<!--
   Copyright 2024 The Google Research Authors.
  
   Licensed under the Apache License, Version 2.0 (the "License");
   you may not use this file except in compliance with the License.
   You may obtain a copy of the License at
  
       http://www.apache.org/licenses/LICENSE-2.0
  
   Unless required by applicable law or agreed to in writing, software
   distributed under the License is distributed on an "AS IS" BASIS,
   WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
   See the License for the specific language governing permissions and
   limitations under the License.
-->

<!DOCTYPE html>
<meta charset="utf-8">
<title>User guide for Anthea</title>
<html>
<body>
  <h2>
    Anthea: A translation quality evaluation tool
  </h2>

  <h3>
    Summary
  </h3>
  <ul>
    <li>Projects are created by opening "Project TSV files" that have
      four fields: source and target segments, document name, system name.</li>
    <li>Several variants of MQM templates are supported.</li>
    <li>State is automatically saved in the browser's local storage.</li>
    <li>After evaluating, projects can be saved as "Evaluation JSON files".</li>
    <li>Evaluation JSON files can be opened for review.
      <a href="https://github.com/google-research/google-research/tree/master/marot">Marot</a>-based
      analysis gets shown in a tab if there are any MQM evaluations
      in the opened files.</li>
  </ul>

  <h3>
    Manage projects
  </h3>
  <p>
    You can manage the projects through the dropdown menu after clicking on <b>Anthea</b> in the top
    left corner. The following operations are supported, in the same order as presented in the menu:
  </p>
  <ul>
    <li><b>Create</b> a new project</li>
    <li><b>Load</b> an existing project</li>
    <li><b>Download</b> evaluations as a JSON file</li>
    <li><b>View</b> the evaluation JSON file</li>
  </ul>

  <h4>
    Create a new project
  </h4>
  <p>
    A new project is created by selecting an evaluation template and opening a project TSV file.
    Once created, the project setup is displayed in the top right corner.
  </p>

  <p>
    The TSV file must have a header line. The header line should be the JSON encoding of an
    object that identifies the "source_language" and "target_language". It may include other
    metadata fields as needed. Example:
    <blockquote>
      <code>
      {"source_language": "[language-code-BCP47]", "target_language": "[language-code-BCP47]"}
      </code>
    </blockquote>
  </p>

  <p>
    After the header line, the TSV file must contain these four fields in the following order, on
    every subsequent line:
  </p>
  <ul>
    <li>source segment</li>
    <li>target segment</li>
    <li>document name</li>
    <li>system name</li>
  </ul>
  <p>
    A fifth field is optional and contains any additional annotations in the
    form of a JSON-encoded object. The object may carry certain specific keyed
    information, including <b>references</b> (a mapping from names of reference
    translations to reference translations) and <b>primary_reference</b> (the
    name of the primary reference, required if "references" is present).
    A <b>segment</b> (also referred to as a "sentence group") is a group of
    sentences that will be incrementally shown for evaluation. Typically, a
    segment consists of one sentence or one paragraph.
  </p>

  <p>
    In addition, the data in the TSV file is interpreted in these special ways:
  </p>
  <ul>
    <li>
      Blank lines indicate paragraph breaks between segments. Lines containing just
      two fields (presumably document name and system name) are also treated the same
      as blank lines.
    </li>
    <li>
      The text in "source_segment" and "target_segment" is unescaped, to replace any
      "\\n" with "\n" and any "\\t" with "\t".
    </li>
    <li>
      Within a segment (source or target), two (or more) consecutive newlines at the
      end of a sentence (see below for how sentence boundaries are marked) get rendered
      as a paragraph break. A single newline at the end of a sentence gets rendered as a
      line-break.
    </li>
  </ul>

  <h5>
    Word and sentence boundaries
  </h5>
  <p>
    Anthea does word segmentation by using two splitting characters: space and
    zero-width space ('\u200b'). Many use cases are simply handled by space
    segmentation alone. For CJK languages, and for segmenting out punctuation, etc.,
    you should use some external tool to do the segmentation, and insert zero-width
    space characters in the TSV files provided to Anthea.
  </p>
  <p>
    Two consecutive zero-width spaces ('\u200b\u200b') can be used to indicate sentence
    boundaries in text.
  </p>

  <h4>
    Load an existing project
  </h4>
  <p>
    Past projects are saved in the browser's local storage, and can be loaded again by selecting the
    same project setup as when first created.
  </p>

  <h4>
    Download evaluations as a JSON file
  </h4>
  <p>
    You can save the evaluations as a JSON file at any time during evaluation. You can optionally
    delete the evaluations from the browser's local storage after download.
  </p>

  <h4>
    View the evaluation JSON file
  </h4>
  <p>
    You can view any past evaluation by opening its JSON file. The evaluations will be read-only.
  </p>

  <h3>
    Evaluation templates
  </h3>
  <p>
    Here are the templates currently available:
  </p>
  <dl>
    <dt>MQM</dt>
    <dd>This is the main MQM template.</dd>

    <dt>MQM-Target-First</dt>
    <dd>This is an MQM template in which the target side of a segment is shown
      first.</dd>

    <dt>MQM-CD</dt>
    <dd>This is a simplified MQM template with only "accuracy" and "fluency"
      as error types.</dd>

    <dt>MQM-WebPage</dt>
    <dd>This is a simplified MQM template suitable for evaluating translation
      quality of pieces of text from web pages. It requires image renderings
      of the source web page and its translation (the images are shown as
      context to the raters).</dd>

    <dt>MQM-Monolingual</dt>
    <dd>This is an MQM template where only the target side is
      evaluated.</dd>
  </dl>
  <p>
    You can create a new template as follows. Use a new unique name for the template (say
    <code>'MQM-My-Template-Name'</code>). Add this name to the list of template choices in
    <code>anthea.html</code> like this (please keep the names sorted):
  </p>
  <blockquote>
    <pre>
      &lt;option value="MQM"&gt;MQM&lt;/option&gt;
      ...
      &lt;option value="MQM-My-Template-Name"&gt;MQM-My-Template-Name&lt;/option&gt;
      ...
    </pre>
  </blockquote>
  <p>
    Copy one of the existing template files (among <code>template-*.js</code>) into a new file
    called <code>mqm-template-my-template-name.js</code>, and modify it so that it sets up the new
    template like this:
  </p>
  <blockquote>
    <pre>
      antheaTemplates['MQM-My-Template-Name'] = {
        ...
      };
    </pre>
  </blockquote>

</body>
</html>
